Add GitHub Action for Nextclade annotations #158

huddlej · 2024-03-25T23:58:41Z

Description of proposed changes

Adds rules, config, and GitHub Action file to support running Nextclade on all available HA and NA sequences for H1N1pdm, H3N2, and Vic.

The new Snakemake logic lives in a custom build config named nextclade which skips most of the standard workflow build logic, using only the "download from S3" rules and its own custom rules to run Nextclade for the lineages and segments defined in the build config. This custom logic runs Nextclade with the default dataset per lineage and segment just like the flu_frequencies workflow and uploads the results to S3. Once we have merged this PR, we should be able to automatically run Nextclade with each ingest of new data and run flu_frequencies from the resulting files on S3.

This PR includes a couple of minor changes to other parts of the standard workflow to allow the Nextclade build config YAML to be as simple as possible and also to allow all workflows to download the parsed sequences and metadata from S3 instead of downloading the raw sequences and parsing them again locally.

To run the Nextclade workflow, use the following command:

nextstrain build . upload_all_nextclade_files --configfile profiles/nextclade.yaml

This uploads Nextclade annotations and alignment files to S3 per lineage and segment like seasonal-flu/vic/ha/nextclade.tsv.xz and seasonal-flu/vic/ha/aligned.fasta.xz.

This PR does not produce a merged metadata and Nextclade annotations file like the one used in the flu_frequencies workflow or the analogous ncov metadata files with Nextclade annotations included. I stopped short of creating this merged file in S3, too, because we use a single metadata TSV per lineage (for all segments) but we produce Nextclade annotations per segment. We could upload the metadata TSVs per lineage and segment with Nextclade annotations merged into a single file, but this would duplicate a lot of information across segments. Maybe that duplication is acceptable, but it's worth discussing more internally first.

~~Note: since this PR adds a new GitHub Action, I can't test the action until we've merged the PR into master.~~

Related issue(s)

Related to #144

Checklist

Checks pass
GitHub Action runs and deploys data to S3 as expected

Adds rules, config, and GitHub Action file to support running Nextclade on all available sequences. Not yet tested.

Remove unnecessary configuration parameters from the Nextclade build config and update the workflow to allow these parameters to be missing. Since Snakemake evaluates the Python code in each rule's inputs, outputs, and params, rules that we don't plan to run in the workflow can produce key errors when their config parameters are not defined.

Simplifies the logic to get Nextclade datasets by following the same pattern as the flu_frequencies workflow [1] where we grab the default dataset for a given lineage and segment instead of specifying a reference name. The "broad" and more recent references for H3N2 HA, for example, are not too different from each other, but the Nextclade annotations for the former are far more verbose than for the latter. We also want the files produced by this workflow to plug directly into the flu_frequencies workflow logic, so it is best to use the same approach here. [1] https://github.com/nextstrain/flu_frequencies/blob/6e4298fac3361f4a6751d85bcb963064dbb9eee1/Snakefile#L95

Adds NA to list of segments, since we want to know the subclade annotations for NA as well as HA and use these data to estimate frequencies.

joverlee521 · 2024-03-26T18:23:11Z

Not doing an in-depth review, just reading over this PR because I'm interested in how you handled the different segments.

I stopped short of creating this merged file in S3, too, because we use a single metadata TSV per lineage (for all segments) but we produce Nextclade annotations per segment. We could upload the metadata TSVs per lineage and segment with Nextclade annotations merged into a single file, but this would duplicate a lot of information across segments. Maybe that duplication is acceptable, but it's worth discussing more internally first.

It'd be good to discuss what's the easiest for downstream users here (granted that may currently only be the flu-frequencies workflow?).

Note: since this PR adds a new GitHub Action, I can't test the action until we've merged the PR into master.

It's possible to test a new GitHub Action with either Tom's or Victor's work-around

Add temporary trigger for Nextclade workflow on PR event. This should trigger the workflow when I push the update to the PR. If it works, I should drop this commit again.

huddlej · 2024-03-26T21:28:28Z

It'd be good to discuss what's the easiest for downstream users here

Totally agree, @joverlee521! @plsteinberg is one such downstream user who has to manually join the metadata and HA Nextclade annotations for their project, so we might have an idea of how that could be improved based on their feedback. 😄

It's possible to test a new GitHub Action with either Tom's or Victor's work-around

Gah, I knew this existed but forgot the mechanics. Thank you for the reminder! Trying with Victor's approach now.

The workflow ran successfully, so removing this trigger.

Reduce threads requested for Nextclade runs from 16 to 12 so we can run 3 Nextclade jobs at once (one per lineage) on a 36-core instance of AWS Batch.

huddlej · 2024-04-01T17:03:22Z

I'm going to merge this now, so I can start running the GitHub Action after our weekly ingests. We should continue to discuss this implementation in the future, though, since there is likely still remove to improve the user experience.

huddlej added 4 commits March 25, 2024 16:56

Prototype GitHub Action for Nextclade annotations

7ee96b5

Adds rules, config, and GitHub Action file to support running Nextclade on all available sequences. Not yet tested.

Run Nextclade for NA

3ba4c13

Adds NA to list of segments, since we want to know the subclade annotations for NA as well as HA and use these data to estimate frequencies.

huddlej marked this pull request as ready for review March 26, 2024 16:46

huddlej requested a review from rneher March 26, 2024 16:54

Trigger Nextclade on PR

b4464b0

Add temporary trigger for Nextclade workflow on PR event. This should trigger the workflow when I push the update to the PR. If it works, I should drop this commit again.

huddlej added 2 commits March 26, 2024 14:51

Remove pull request trigger

4adf731

The workflow ran successfully, so removing this trigger.

Set Nextclade threads to a factor of 36

94cdfe8

Reduce threads requested for Nextclade runs from 16 to 12 so we can run 3 Nextclade jobs at once (one per lineage) on a 36-core instance of AWS Batch.

huddlej changed the title ~~Prototype GitHub Action for Nextclade annotations~~ Add GitHub Action for Nextclade annotations Mar 30, 2024

huddlej merged commit a162342 into master Apr 1, 2024
3 checks passed

huddlej deleted the add-nextclade-github-action branch April 1, 2024 17:03

joverlee521 mentioned this pull request May 13, 2024

Upload trigger matrix #165

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GitHub Action for Nextclade annotations #158

Add GitHub Action for Nextclade annotations #158

huddlej commented Mar 25, 2024 •

edited

Loading

joverlee521 commented Mar 26, 2024

huddlej commented Mar 26, 2024

huddlej commented Apr 1, 2024

Add GitHub Action for Nextclade annotations #158

Add GitHub Action for Nextclade annotations #158

Conversation

huddlej commented Mar 25, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

joverlee521 commented Mar 26, 2024

huddlej commented Mar 26, 2024

huddlej commented Apr 1, 2024

huddlej commented Mar 25, 2024 •

edited

Loading